Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Community: add modified_since argument to O365BaseLoader #28708

Merged

Conversation

MacanPN
Copy link
Contributor

@MacanPN MacanPN commented Dec 13, 2024

What are we doing in this PR

We're adding modified_since optional argument to O365BaseLoader. When set, O365 loader will only load documents newer than modified_since datetime.

Why?

OneDrives / Sharepoints can contain large number of documents. Current approach is to download and parse all files and let indexer to deal with duplicates. This can be prohibitively time-consuming. Especially when using OCR-based parser like zerox. This argument allows to skip documents that are older than known time of indexing.

Q: What if a file was modfied during last indexing process?
A: Users can set the modified_since conservatively and indexer will still take care of duplicates.

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

Copy link

vercel bot commented Dec 13, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Dec 13, 2024 5:26pm

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Dec 13, 2024
Copy link
Member

@efriis efriis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if you're interested in maintaining this integration without us in the loop, we'd love to get an integration package out! Future PRs against langchain would just be {docs updates, as well as registering your package in libs/packages.yml, deprecating this community integration in favor of your integration package}

Here's the guide, and if you have questions, feel free to leave them in the comments on those pages so others can see them! https://python.langchain.com/docs/contributing/how_to/integrations/

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Dec 13, 2024
@efriis efriis enabled auto-merge (squash) December 13, 2024 17:26
@efriis efriis self-assigned this Dec 13, 2024
@efriis efriis merged commit 05ebe1e into langchain-ai:master Dec 13, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) lgtm PR looks good. Use to confirm that a PR is ready for merging. size:M This PR changes 30-99 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants